This report has been collected from two secondary schools, and from two classes, a math class and a Portuguese language class.The factors presented in the data set are social factors and will hopefully provide in depth analysis into teen alcohol consumption.
With this information we will be able to determine if alcohol consumption is an issue for the students in this secondary school, and what relationship social factors have on consumption of alcohol. We assume the below questions will help identify a relationship between alcohol consumption, student performance, free time, interest in attending a university, and factors based on family cohabitation status and health status. Finding a relationship between these factors will allow us to predict the likelihood of a student consuming alcohol.Also, we will compare key differences in alcohol consumption between the math and language class.
# Please install packages if not in library
library(ggplot2)
library(plyr)
library(grid)
library(gridExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(skimr)
library(tidyr)
# Read student-mat.csv, student-por.csv store it in dataframe and merge them into single dataframe
mat_students = read.csv("student-mat.csv")
por_students = read.csv("student-por.csv")
# Merge math and Portuguese class together into one data-frame
mat_por_students = rbind(mat_students, por_students)
# Check and omit missing data and omit them
if (sum(is.na(mat_por_students)) != 0) {
mat_por_students <- na.omit(mat_por_students)
}
sum(is.na(mat_por_students))
## [1] 0
dim(mat_por_students) # dim gives dimensions of object
## [1] 1044 33
The data set contains 33 variables and 1044 observations.
summary(mat_por_students)
## school gender age address family.size
## GP:772 F:591 Min. :15.00 R:285 GT3:738
## MS:272 M:453 1st Qu.:16.00 U:759 LE3:306
## Median :17.00
## Mean :16.73
## 3rd Qu.:18.00
## Max. :22.00
## parents.cohabitation.status mothers.education fathers.education mothers.job
## A:121 Min. :0.000 Min. :0.000 at_home :194
## T:923 1st Qu.:2.000 1st Qu.:1.000 health : 82
## Median :3.000 Median :2.000 other :399
## Mean :2.603 Mean :2.388 services:239
## 3rd Qu.:4.000 3rd Qu.:3.000 teacher :130
## Max. :4.000 Max. :4.000
## fathers.job reason.for.school.selection guardian traveltime
## at_home : 62 course :430 father:243 Min. :1.000
## health : 41 home :258 mother:728 1st Qu.:1.000
## other :584 other :108 other : 73 Median :1.000
## services:292 reputation:248 Mean :1.523
## teacher : 65 3rd Qu.:2.000
## Max. :4.000
## studytime past.class.failures extra.school.support
## Min. :1.00 Min. :0.0000 no :925
## 1st Qu.:1.00 1st Qu.:0.0000 yes:119
## Median :2.00 Median :0.0000
## Mean :1.97 Mean :0.2644
## 3rd Qu.:2.00 3rd Qu.:0.0000
## Max. :4.00 Max. :3.0000
## family.education.support extra.paid.classes extra.curricular.activities
## no :404 no :824 no :528
## yes:640 yes:220 yes:516
##
##
##
##
## attended.nursery.school higher.education internet.access romantic.relationship
## no :209 no : 89 no :217 no :673
## yes:835 yes:955 yes:827 yes:371
##
##
##
##
## quality.of.family.relationships freetime going.out.with.friends
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :3.000 Median :3.000
## Mean :3.936 Mean :3.201 Mean :3.156
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
## weekday.alcohol.consumption weekend.alcohol.consumption health.status
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :1.000 Median :2.000 Median :4.000
## Mean :1.494 Mean :2.284 Mean :3.543
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## school.absences first.period.grade second.period.grade final.grade
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 9.00 1st Qu.: 9.00 1st Qu.:10.00
## Median : 2.000 Median :11.00 Median :11.00 Median :11.00
## Mean : 4.435 Mean :11.21 Mean :11.25 Mean :11.34
## 3rd Qu.: 6.000 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :75.000 Max. :19.00 Max. :19.00 Max. :20.00
skim(mat_por_students) # Skim is an alternative to summary, which provides a broad view of the dataframe.
| Name | mat_por_students |
| Number of rows | 1044 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| factor | 17 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| school | 0 | 1 | FALSE | 2 | GP: 772, MS: 272 |
| gender | 0 | 1 | FALSE | 2 | F: 591, M: 453 |
| address | 0 | 1 | FALSE | 2 | U: 759, R: 285 |
| family.size | 0 | 1 | FALSE | 2 | GT3: 738, LE3: 306 |
| parents.cohabitation.status | 0 | 1 | FALSE | 2 | T: 923, A: 121 |
| mothers.job | 0 | 1 | FALSE | 5 | oth: 399, ser: 239, at_: 194, tea: 130 |
| fathers.job | 0 | 1 | FALSE | 5 | oth: 584, ser: 292, tea: 65, at_: 62 |
| reason.for.school.selection | 0 | 1 | FALSE | 4 | cou: 430, hom: 258, rep: 248, oth: 108 |
| guardian | 0 | 1 | FALSE | 3 | mot: 728, fat: 243, oth: 73 |
| extra.school.support | 0 | 1 | FALSE | 2 | no: 925, yes: 119 |
| family.education.support | 0 | 1 | FALSE | 2 | yes: 640, no: 404 |
| extra.paid.classes | 0 | 1 | FALSE | 2 | no: 824, yes: 220 |
| extra.curricular.activities | 0 | 1 | FALSE | 2 | no: 528, yes: 516 |
| attended.nursery.school | 0 | 1 | FALSE | 2 | yes: 835, no: 209 |
| higher.education | 0 | 1 | FALSE | 2 | yes: 955, no: 89 |
| internet.access | 0 | 1 | FALSE | 2 | yes: 827, no: 217 |
| romantic.relationship | 0 | 1 | FALSE | 2 | no: 673, yes: 371 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 16.73 | 1.24 | 15 | 16 | 17 | 18 | 22 | ▇▅▅▁▁ |
| mothers.education | 0 | 1 | 2.60 | 1.12 | 0 | 2 | 3 | 4 | 4 | ▁▅▇▆▇ |
| fathers.education | 0 | 1 | 2.39 | 1.10 | 0 | 1 | 2 | 3 | 4 | ▁▆▇▆▆ |
| traveltime | 0 | 1 | 1.52 | 0.73 | 1 | 1 | 1 | 2 | 4 | ▇▅▁▁▁ |
| studytime | 0 | 1 | 1.97 | 0.83 | 1 | 1 | 2 | 2 | 4 | ▅▇▁▂▁ |
| past.class.failures | 0 | 1 | 0.26 | 0.66 | 0 | 0 | 0 | 0 | 3 | ▇▁▁▁▁ |
| quality.of.family.relationships | 0 | 1 | 3.94 | 0.93 | 1 | 4 | 4 | 5 | 5 | ▁▁▂▇▅ |
| freetime | 0 | 1 | 3.20 | 1.03 | 1 | 3 | 3 | 4 | 5 | ▁▃▇▆▂ |
| going.out.with.friends | 0 | 1 | 3.16 | 1.15 | 1 | 2 | 3 | 4 | 5 | ▂▆▇▆▃ |
| weekday.alcohol.consumption | 0 | 1 | 1.49 | 0.91 | 1 | 1 | 1 | 2 | 5 | ▇▂▁▁▁ |
| weekend.alcohol.consumption | 0 | 1 | 2.28 | 1.29 | 1 | 1 | 2 | 3 | 5 | ▇▅▅▃▂ |
| health.status | 0 | 1 | 3.54 | 1.42 | 1 | 3 | 4 | 5 | 5 | ▃▂▅▃▇ |
| school.absences | 0 | 1 | 4.43 | 6.21 | 0 | 0 | 2 | 6 | 75 | ▇▁▁▁▁ |
| first.period.grade | 0 | 1 | 11.21 | 2.98 | 0 | 9 | 11 | 13 | 19 | ▁▂▇▇▂ |
| second.period.grade | 0 | 1 | 11.25 | 3.29 | 0 | 9 | 11 | 13 | 19 | ▁▂▇▇▂ |
| final.grade | 0 | 1 | 11.34 | 3.86 | 0 | 10 | 11 | 14 | 20 | ▁▂▇▆▁ |
str(mat_por_students)
## 'data.frame': 1044 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ family.size : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ parents.cohabitation.status : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ mothers.education : int 4 1 1 4 3 4 2 4 3 3 ...
## $ fathers.education : int 4 1 1 2 3 3 2 4 2 4 ...
## $ mothers.job : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ fathers.job : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason.for.school.selection : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime : int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ past.class.failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ extra.school.support : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ family.education.support : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ extra.paid.classes : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ extra.curricular.activities : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ attended.nursery.school : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher.education : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet.access : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic.relationship : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ quality.of.family.relationships: int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ going.out.with.friends : int 4 3 2 2 2 2 4 4 2 1 ...
## $ weekday.alcohol.consumption : int 1 1 2 1 1 1 1 1 1 1 ...
## $ weekend.alcohol.consumption : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health.status : int 3 3 3 5 5 5 3 1 1 5 ...
## $ school.absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ first.period.grade : int 5 5 7 15 6 15 12 6 16 14 ...
## $ second.period.grade : int 6 5 8 14 10 15 12 5 18 15 ...
## $ final.grade : int 6 6 10 15 10 15 11 6 19 15 ...
mean(mat_por_students$weekday.alcohol.consumption)
## [1] 1.494253
mean(mat_por_students$weekend.alcohol.consumption)
## [1] 2.284483
summary(mat_por_students$weekday.alcohol.consumption)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.494 2.000 5.000
summary(mat_por_students$weekend.alcohol.consumption)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.284 3.000 5.000
Using the “head” and “tail” functions will allow for insight in the first six rows of data (head) and the last six rows of data (tail).
head(mat_por_students)
tail(mat_por_students)
Head- The first 6 rows of data consist of mostly female students with parents who are still together, 3 of the mothers stay at home and all of the fathers work. Only 2 of the students have supplemental school support, and they all spend 2 or more hours studying a week. Additionally all 6 students want higher education, and all have at least one alcoholic beverage per week.
Tail- The last 6 rows of data consist 4 females all of their parents are together, in this case all of the mothers work while 5 of the fathers work and 1 stays home. The students in this sample have no supplemental school support, and spend 1-3 hours studying each week. All students want higher education and all consume 1 or more alcoholic beverages a week.
Will use table and ggplot functions to view gender of students
table(mat_por_students$gender)
##
## F M
## 591 453
ggplot(aes(x =age, fill=gender) , data = mat_por_students) + geom_histogram(binwidth=.1)+ ggtitle("Age and Gender of Students")
gender <- as.factor(mat_por_students$gender)
Will use a boxplot to get the average age of females and males in the dataset
plot(gender,mat_por_students$age)
1.The average age of students who are both female and male in this dataset is 17, the youngest is 15 and the oldest is 21. However, there is an outlier, a male student who is 22 years of age. 2. There are 158 more females than males in the dataset
Will use table to review the number of students who have 1 or more alcoholic beverages per week and on weekends.
table(mat_por_students$weekday.alcohol.consumption)
##
## 1 2 3 4 5
## 727 196 69 26 26
prop.table(table(mat_por_students$weekday.alcohol.consumption))
##
## 1 2 3 4 5
## 0.69636015 0.18773946 0.06609195 0.02490421 0.02490421
hist(mat_por_students$weekday.alcohol.consumption, main="Number of Alcoholic Beverages Consumed on Weekdays",xlab="Students", border="blue", col="purple",ylab="Weekely alcohol frequency")
table(mat_por_students$weekend.alcohol.consumption)
##
## 1 2 3 4 5
## 398 235 200 138 73
prop.table(table(mat_por_students$weekend.alcohol.consumption))
##
## 1 2 3 4 5
## 0.38122605 0.22509579 0.19157088 0.13218391 0.06992337
hist(mat_por_students$weekend.alcohol.consumption, main="Number of Alcoholic Beverages Consumed on Weekends",xlab="Students", border="blue", col="yellow",ylab="Weekend alcohol frequency")
On weekdays: 69% or 451 students consume 1 alcoholic beverage. 18% of students consume 2 alcoholic beverages and 6% consume 3 beverages. Approximately 4% of students consume 4 or 5 alcholic beverages. On weekends: 38% or 247 students consume 1 alcoholic beverage, we see an increase in consumption of 2(23%) and 3(18%) alcoholic beverages. 19% of the students consume 4-5 alcoholic beverages on weekends.
We will look at the difference of alcohol consumption by gender by creating tables and bar charts for categorical variables
table(mat_por_students$weekday.alcohol.consumption,mat_por_students$gender)
##
## F M
## 1 472 255
## 2 91 105
## 3 16 53
## 4 9 17
## 5 3 23
table(mat_por_students$weekend.alcohol.consumption,mat_por_students$gender)
##
## F M
## 1 270 128
## 2 150 85
## 3 116 84
## 4 44 94
## 5 11 62
ggplot(aes(x =weekday.alcohol.consumption, fill=gender) , data = mat_por_students) + geom_histogram(binwidth=.1)+ ggtitle("Weekday Alcohol Consumption by Gender")
ggplot(aes(x =weekend.alcohol.consumption, fill=gender) , data = mat_por_students) + geom_histogram(binwidth=.1)+ ggtitle("Weekend Alcohol Consumption by Gender")
There are 117 more females than males in the Portuguese language class, however we see that more males consume 3 or more alcoholic beverages during the weekday.On the weekend more than double the amount of males and females consume 3 or more alcoholic beverages.
We will use a pie chart to highlight the family size of students by GT3 (greater than 3) or LE3 (less than 3)
table(mat_por_students$age,mat_por_students$family.size)
##
## GT3 LE3
## 15 139 55
## 16 201 80
## 17 200 77
## 18 143 79
## 19 44 12
## 20 8 1
## 21 1 2
## 22 2 0
familytable <- table(mat_por_students$family.size)
lbls <- paste(names(familytable), "\n", familytable, sep="")
pie(familytable, labels = lbls,
main="Pie Chart of Family Size")
70% of students have family size greater than 3 30% of students have family size less than 3
The stacked bar chart will provide better insight into the gender of students and family size
ggplot(aes(x = family.size), data = mat_por_students) + geom_bar(aes(fill=gender))+ ggtitle("Family Size in Relation to Gender of Students")
We will plot parents cohabitation status with A= Apart and T= Together, along with gender to see any relation between variables
ggplot(aes(x = quality.of.family.relationships) , data = mat_por_students) + geom_bar(aes(fill=gender)) + facet_wrap(~ parents.cohabitation.status)+ ggtitle("Parents Cohabitation Status and Gender")
ggplot(aes(x=studytime,fill=parents.cohabitation.status),data= mat_por_students) +geom_bar() +xlab('Weekly Study Time')+ scale_fill_manual(values=c( "#E69F00", "#56B4E9"))+ggtitle("Weekly Study Time and Parents Cohabitation Status")
table(mat_por_students$gender,mat_por_students$studytime)
##
## 1 2 3 4
## F 116 311 126 38
## M 201 192 36 24
#Particiption in Extra- Curricular Activities
We will plot if students are involved in extra curricular activities (Yes or No) and analyze the relation to students grades
table(mat_por_students$extra.curricular.activities,mat_por_students$gender)
##
## F M
## no 329 199
## yes 262 254
Out of 1044 students, only 49% of students participate in extra-curricular activities, out of which 50% are females and 50% are males.
ggplot(aes(x = final.grade), data = mat_por_students) + geom_bar(aes(fill=extra.curricular.activities))+ggtitle("Final Grade and Participation in Extra Curricular Activities")
From above graph, it seems that participation in extra curricular activities does not have any effect on the grades of the student.
table(mat_por_students$higher.education,mat_por_students$gender)
##
## F M
## no 39 50
## yes 552 403
It can observed that out of 1044 students, 935 students wish to take higher education, i.e almost 89% of students desire to take higher education. Out of which 52.8% of females and only 39% males wish to take higher education.
ggplot(aes(x = studytime), data =mat_por_students) +
geom_histogram(aes(fill=gender), binwidth=0.1) +
facet_wrap(higher.education ~ address)+ ggtitle("Study Time and Higher Education by Urban/Rural Area")
ggplot(data=mat_por_students,aes(x=higher.education, y=final.grade))+ geom_boxplot() + facet_grid(~gender) + ggtitle("Higher education vs final grade of students")
mean(mat_por_students$past.class.failures)
## [1] 0.2643678
Average past failures among the students is 0.2
# Total alcohol consumption = combine weekday and weekend alcohol consumption of the students.
mat_por_students$total_alcohol_consumption = rowSums(cbind(mat_por_students$weekday.alcohol.consumption,mat_por_students$weekend.alcohol.consumption))
mean(mat_por_students$total_alcohol_consumption)
## [1] 3.778736
The average alcohol consumption level of students is 3.778
Covariance between the alcohol consumption and the failure of students
cov(mat_por_students$total_alcohol_consumption,mat_por_students$past.class.failures)
## [1] 0.1601812
The covariance between 2 variables is 0.16, it means it has positive relation but the correlation very weak.
# Converting factor variable to categorical variable.
failures = factor(mat_por_students$past.class.failures,labels=c('Never Failed','Failed once',' failed twice','Failed Thrice'))
boxplot(total_alcohol_consumption ~ failures,summary,data=mat_por_students)
##### Does alcohol consumption have any impact on the grades by students?
cor(mat_por_students$final.grade, mat_por_students$weekday.alcohol.consumption)
## [1] -0.1296421
cor(mat_por_students$final.grade, mat_por_students$weekend.alcohol.consumption)
## [1] -0.11574
The correlations between alcohol consumption and final grades is negative, it implies that more alcohol consumption student level, decreases performance of students. But the relation is not very strong.
mat_por_students%>%
group_by(gender)%>%
ggplot(aes(x=factor(health.status), y=school.absences, color=gender))+
geom_smooth(aes(group=gender), method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
It is observed that female students have lower attendance on average and as the health scale increases, the absence decreases as expected for both male and female students.
mean(mat_por_students$total_alcohol_consumption)
## [1] 3.778736
Total alcohol consumption level of students is 3.78
weekday_alcohol_consumption_vs_famrel = ggplot(mat_por_students, aes(x=weekday.alcohol.consumption)) +
geom_density(aes(color=as.factor(quality.of.family.relationships))) +
ggtitle("Distribution of Students' Weekly Alchohol Consumption Level by Quality Family Relationship")
weekend_alcohol_consumption_vs_famrel = ggplot(mat_por_students, aes(x=weekend.alcohol.consumption)) +
geom_density(aes(color=as.factor(quality.of.family.relationships))) +
ggtitle("Distribution of Students' Weekend Alchohol Consumption Level by Quality Family Relationship")
grid.arrange(weekday_alcohol_consumption_vs_famrel, weekend_alcohol_consumption_vs_famrel)
It can be observed that a students who don’t have enough good family relationship status, consume more alcohol for both in weekdays or weekends.
weekday_alcohol_consumption_vs_address<-mat_por_students %>%
group_by(address)%>%
ggplot(aes(x=factor(weekday.alcohol.consumption), y= final.grade,color=factor(weekday.alcohol.consumption)))+
geom_jitter(alpha=0.6)+
scale_x_discrete("Weekday Alcohol Consumption")+
scale_y_continuous("Grade")+
facet_grid(~address)+ ggtitle("Consumption of Alcohol based on Geographical Location")
weekend_alcohol_consumption_vs_address<-mat_por_students %>%
group_by(address)%>%
ggplot(aes(x=factor(weekend.alcohol.consumption), y= final.grade,color=factor(weekend.alcohol.consumption)))+
geom_jitter(alpha=0.6)+
scale_x_discrete("Weekend Alcohol Consumption")+
scale_y_continuous("Grade")+
facet_grid(~address)
grid.arrange(weekday_alcohol_consumption_vs_address,weekend_alcohol_consumption_vs_address)
We can observe from above that student living in urban areas has higher alcohol consumption rate as compared to rural areas, also performance of students gradually start decreasing as their alcohol consumption increases.
# Convert age factor to numeric
age = mat_por_students$age
ageI = as.numeric(age)
# Weekday alcohol consumption level vs Age
hist(ageI, main = "Alcohol Consumption on Weekdays", xlab = "age", ylab="weekday.alcohol.consumption",border="brown", col="red")
hist_weekend_alchol_consumption_vs_age = hist(ageI, main = "Alcohol Consumption on Weekends", xlab = "age", ylab="weekend.alcohol.consumption",border="brown", col="pink")
If age range is considered between 15-22 yrs old, it can be observed that students from age 15-18 consume more alcohol on weekdays and weekends.
ggplot(mat_por_students, aes(x=total_alcohol_consumption, y=past.class.failures, color=gender))+
geom_jitter(alpha=0.9)+ theme_bw()+ xlab("Total alcohol consumption")+
ylab("Past Failure rate on a scale of 5")+
ggtitle("Total alcohol consumption vs past failure rate")
The past-class failure doesn’t seem affecting the alcohol consumption rate of students.
# Parents average education level w.r.t student alcohol consumption rate
mat_por_students$parents_education = rowSums(cbind(mat_por_students$mothers.education,mat_por_students$fathers.education))
ggplot(mat_por_students, aes(y = as.numeric(total_alcohol_consumption) , x= parents_education)) + geom_col()
We cannot find an exact pattern for association between alcohol consumption and parents education.
mat_por_students$weekday.alcohol.consumption.factor <- as.factor(mat_por_students$weekday.alcohol.consumption)
mat_por_students$weekend.alcohol.consumption.factor <- as.factor(mat_por_students$weekend.alcohol.consumption)
weekday_alcohol_consumption<-mat_por_students %>%
ggplot(aes(x=weekday.alcohol.consumption.factor, y=final.grade, fill= weekday.alcohol.consumption.factor))+
geom_boxplot()+
coord_flip()+
xlab("Work Day Alcohol consumption")+
ylab("Grade")+
facet_grid(~gender)
weekend_alcohol_consumption<-mat_por_students %>%
ggplot(aes(x=weekend.alcohol.consumption.factor, y=final.grade, fill= weekend.alcohol.consumption.factor))+
geom_boxplot()+
coord_flip()+
xlab("Week End Alcohol consumption")+
ylab("Grade")+
facet_grid(~gender)
grid.arrange(weekday_alcohol_consumption,weekend_alcohol_consumption)
ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=final.grade)) +
geom_point() + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=final.grade)) +
geom_point() + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
It can be illustrated that students who consume more alcohol, perform poor in examination.
boxplot1 <- ggplot(mat_por_students, aes(x=weekday.alcohol.consumption, y=first.period.grade, fill=weekday.alcohol.consumption))+
geom_boxplot(aes(group = weekday.alcohol.consumption))+
theme_test()+
xlab("Weekly Alcohol consumption")+
ylab("First period grade")+
ggtitle("First period grade") + theme(legend.position = "none")
boxplot2 <- ggplot(mat_por_students, aes(x=weekday.alcohol.consumption, y=second.period.grade, fill=weekday.alcohol.consumption))+
geom_boxplot(aes(group = weekday.alcohol.consumption))+
theme_dark()+
xlab("Weekly Alcohol consumption")+
ylab("Second period grade")+
ggtitle("Second period grade") + theme(legend.position = "none")
boxplot3 <- ggplot(mat_por_students, aes(x=weekday.alcohol.consumption, y=final.grade, fill=weekday.alcohol.consumption))+
geom_boxplot(aes(group = weekday.alcohol.consumption))+
theme_linedraw()+
xlab("Weekly Alcohol consumption")+
ylab("Final period grade")+
ggtitle("Final period grade") + theme(legend.position = "none")
grid.arrange(boxplot1, boxplot2, boxplot3, ncol = 3,top=textGrob("Weekly Alcohol consumption level VS Grades ",gp=gpar(fontsize=10,font=4))
)
It can be clearly observed that as weekly alcohol consumption increases, performance of student decreases.
boxplot1 <- ggplot(mat_por_students, aes(x=weekend.alcohol.consumption, y=first.period.grade, fill=weekend.alcohol.consumption))+
geom_boxplot(aes(group = weekend.alcohol.consumption))+
theme_test()+
xlab("Weekend Alcohol consumption")+
ylab("First period grade")+
ggtitle("First period grade") + theme(legend.position = "none")
boxplot2 <- ggplot(mat_por_students, aes(x=weekend.alcohol.consumption, y=second.period.grade, fill=weekend.alcohol.consumption))+
geom_boxplot(aes(group = weekend.alcohol.consumption))+
theme_dark()+
xlab("Weekend Alcohol consumption")+
ylab("Second period grade")+
ggtitle("Second period grade") + theme(legend.position = "none")
boxplot3 <- ggplot(mat_por_students, aes(x=weekend.alcohol.consumption, y=final.grade, fill=weekend.alcohol.consumption))+
geom_boxplot(aes(group = weekend.alcohol.consumption))+
theme_linedraw()+
xlab("Weekend Alcohol consumption")+
ylab("Final period grade")+
ggtitle("Final period grade") + theme(legend.position = "none")
grid.arrange(boxplot1, boxplot2, boxplot3, ncol = 3,top=textGrob("Weekend Alcohol consumption level VS Grades ",gp=gpar(fontsize=10,font=4))
)
It can be clearly observed that as weekend alcohol consumption increases, performance of student decreases.
weekly_alc_freetime = ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=freetime)) +
geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekly alcohol consumption level v.s freetime")
weekend_alc_freetime = ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=freetime)) +
geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s freetime")
grid.arrange(weekly_alc_freetime, weekend_alc_freetime , ncol = 2,top=textGrob("Alcohol consumption level VS freetime ",gp=gpar(fontsize=10,font=4))
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
weekly_alc_going_out = ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=going.out.with.friends)) +
geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekly alcohol consumption level v.s going out with friends")
weekend_alc_going_out = ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=going.out.with.friends)) +
geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s going out with friends")
grid.arrange(weekly_alc_going_out, weekend_alc_going_out , ncol = 2,top=textGrob("Alcohol consumption level VS going out with friends ",gp=gpar(fontsize=10,font=4))
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
It can be observed that going out with friends increases the alcohol consumption level of students for both weekend and weekdays.
weekday_alc_study_time = ggplot(mat_por_students,aes(x=weekday.alcohol.consumption,y=studytime)) +
geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s going out with friends")
weekend_alc_study_time = ggplot(mat_por_students,aes(x=weekend.alcohol.consumption,y=studytime)) +
geom_point() + geom_smooth(method = 'lm') + ggtitle("Weekend alcohol consumption level v.s going out with friends")
grid.arrange(weekday_alc_study_time, weekend_alc_study_time , ncol = 2,top=textGrob("Alcohol consumption level VS studytime ",gp=gpar(fontsize=10,font=4))
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
It can be observed that alcohol consumption and studytime have negative correlation, i.e students studytime decreases, alcohol consumtion increases.
From above all factors, we found out that, studytime, freetime, going out with friend, address, family relations and student grades have significant impact on the alcohol consumption habit of students. Let us build a linear regression model to describe their correlation.
Since all of the p-values are less than 0.05, we should rule out the null hypothesis that the coefficient is zero for each variable. All variables aren’t entirely irrelevant to alcohol intake on weekdays. Regression Model= 5.82e-05+5.07e-11schoolMS+(-2e-16)genderM+4.69e-07age+.015AddressU+.022StudyTime+.013ExtraCurcActiv.+.017Freetime(-2e-16)GoingoutwithFriends+(-2e-16)*Weekend Alcohol Consumption
weekend_alcohol_consumption_survey = lm(weekend.alcohol.consumption~., data= mat_por_students)
coef(weekend_alcohol_consumption_survey)
## (Intercept) schoolMS
## 1.859591e-14 -4.433036e-15
## genderM age
## -1.260509e-14 -8.947536e-16
## addressU family.sizeLE3
## 1.039537e-15 2.772429e-16
## parents.cohabitation.statusT mothers.education
## -4.273212e-16 -4.749574e-17
## fathers.education mothers.jobhealth
## 4.852776e-18 3.403729e-17
## mothers.jobother mothers.jobservices
## -2.955321e-16 -1.033102e-15
## mothers.jobteacher fathers.jobhealth
## -9.868835e-16 6.623150e-16
## fathers.jobother fathers.jobservices
## 1.095695e-15 5.509421e-16
## fathers.jobteacher reason.for.school.selectionhome
## 5.047604e-16 -4.735944e-16
## reason.for.school.selectionother reason.for.school.selectionreputation
## 1.121461e-15 -5.552352e-16
## guardianmother guardianother
## 2.226133e-16 -4.449912e-16
## traveltime studytime
## -3.636774e-16 1.841462e-16
## past.class.failures extra.school.supportyes
## 1.047072e-15 2.234714e-16
## family.education.supportyes extra.paid.classesyes
## -1.664347e-16 1.058920e-15
## extra.curricular.activitiesyes attended.nursery.schoolyes
## 9.744521e-16 4.888939e-16
## higher.educationyes internet.accessyes
## 2.472096e-16 4.860300e-16
## romantic.relationshipyes quality.of.family.relationships
## -5.971048e-16 5.183392e-16
## freetime going.out.with.friends
## 4.724658e-16 -2.266561e-15
## weekday.alcohol.consumption health.status
## -1.000000e+00 -7.019046e-17
## school.absences first.period.grade
## 1.081548e-17 -3.012516e-19
## second.period.grade final.grade
## -2.105156e-16 1.942042e-16
## total_alcohol_consumption parents_education
## 1.000000e+00 NA
## weekday.alcohol.consumption.factor2 weekday.alcohol.consumption.factor3
## 8.311543e-16 1.594434e-17
## weekday.alcohol.consumption.factor4 weekday.alcohol.consumption.factor5
## -2.954519e-16 NA
## weekend.alcohol.consumption.factor2 weekend.alcohol.consumption.factor3
## 2.552545e-16 6.004324e-16
## weekend.alcohol.consumption.factor4 weekend.alcohol.consumption.factor5
## -5.463281e-18 NA
summary(weekend_alcohol_consumption_survey)
##
## Call:
## lm(formula = weekend.alcohol.consumption ~ ., data = mat_por_students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.472e-14 -1.130e-15 -1.800e-17 9.880e-16 1.541e-13
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.860e-14 3.766e-15 4.938e+00 9.25e-07
## schoolMS -4.433e-15 5.317e-16 -8.338e+00 2.50e-16
## genderM -1.261e-14 4.540e-16 -2.777e+01 < 2e-16
## age -8.948e-16 1.866e-16 -4.796e+00 1.87e-06
## addressU 1.040e-15 4.961e-16 2.095e+00 0.03639
## family.sizeLE3 2.772e-16 4.477e-16 6.190e-01 0.53588
## parents.cohabitation.statusT -4.273e-16 6.455e-16 -6.620e-01 0.50814
## mothers.education -4.750e-17 2.814e-16 -1.690e-01 0.86602
## fathers.education 4.853e-18 2.505e-16 1.900e-02 0.98455
## mothers.jobhealth 3.404e-17 9.907e-16 3.400e-02 0.97260
## mothers.jobother -2.955e-16 5.848e-16 -5.050e-01 0.61340
## mothers.jobservices -1.033e-15 6.935e-16 -1.490e+00 0.13663
## mothers.jobteacher -9.869e-16 9.174e-16 -1.076e+00 0.28229
## fathers.jobhealth 6.623e-16 1.341e-15 4.940e-01 0.62140
## fathers.jobother 1.096e-15 8.652e-16 1.266e+00 0.20564
## fathers.jobservices 5.509e-16 9.052e-16 6.090e-01 0.54288
## fathers.jobteacher 5.048e-16 1.209e-15 4.180e-01 0.67637
## reason.for.school.selectionhome -4.736e-16 5.121e-16 -9.250e-01 0.35528
## reason.for.school.selectionother 1.121e-15 6.948e-16 1.614e+00 0.10681
## reason.for.school.selectionreputation -5.552e-16 5.369e-16 -1.034e+00 0.30130
## guardianmother 2.226e-16 4.902e-16 4.540e-01 0.64981
## guardianother -4.450e-16 9.404e-16 -4.730e-01 0.63619
## traveltime -3.637e-16 2.975e-16 -1.222e+00 0.22189
## studytime 1.841e-16 2.599e-16 7.090e-01 0.47876
## past.class.failures 1.047e-15 3.486e-16 3.004e+00 0.00273
## extra.school.supportyes 2.235e-16 6.577e-16 3.400e-01 0.73408
## family.education.supportyes -1.664e-16 4.247e-16 -3.920e-01 0.69524
## extra.paid.classesyes 1.059e-15 4.996e-16 2.120e+00 0.03428
## extra.curricular.activitiesyes 9.745e-16 4.071e-16 2.394e+00 0.01686
## attended.nursery.schoolyes 4.889e-16 4.973e-16 9.830e-01 0.32582
## higher.educationyes 2.472e-16 7.724e-16 3.200e-01 0.74900
## internet.accessyes 4.860e-16 5.225e-16 9.300e-01 0.35249
## romantic.relationshipyes -5.971e-16 4.248e-16 -1.406e+00 0.16013
## quality.of.family.relationships 5.183e-16 2.172e-16 2.386e+00 0.01721
## freetime 4.725e-16 2.091e-16 2.260e+00 0.02406
## going.out.with.friends -2.267e-15 1.999e-16 -1.134e+01 < 2e-16
## weekday.alcohol.consumption -1.000e+00 6.045e-16 -1.654e+15 < 2e-16
## health.status -7.019e-17 1.431e-16 -4.900e-01 0.62395
## school.absences 1.082e-17 3.365e-17 3.210e-01 0.74798
## first.period.grade -3.013e-19 1.326e-16 -2.000e-03 0.99819
## second.period.grade -2.105e-16 1.668e-16 -1.262e+00 0.20721
## final.grade 1.942e-16 1.243e-16 1.562e+00 0.11864
## total_alcohol_consumption 1.000e+00 2.893e-16 3.456e+15 < 2e-16
## parents_education NA NA NA NA
## weekday.alcohol.consumption.factor2 8.312e-16 6.485e-16 1.282e+00 0.20028
## weekday.alcohol.consumption.factor3 1.594e-17 1.054e-15 1.500e-02 0.98794
## weekday.alcohol.consumption.factor4 -2.955e-16 1.582e-15 -1.870e-01 0.85190
## weekday.alcohol.consumption.factor5 NA NA NA NA
## weekend.alcohol.consumption.factor2 2.553e-16 5.435e-16 4.700e-01 0.63873
## weekend.alcohol.consumption.factor3 6.004e-16 6.865e-16 8.750e-01 0.38198
## weekend.alcohol.consumption.factor4 -5.463e-18 8.757e-16 -6.000e-03 0.99502
## weekend.alcohol.consumption.factor5 NA NA NA NA
##
## (Intercept) ***
## schoolMS ***
## genderM ***
## age ***
## addressU *
## family.sizeLE3
## parents.cohabitation.statusT
## mothers.education
## fathers.education
## mothers.jobhealth
## mothers.jobother
## mothers.jobservices
## mothers.jobteacher
## fathers.jobhealth
## fathers.jobother
## fathers.jobservices
## fathers.jobteacher
## reason.for.school.selectionhome
## reason.for.school.selectionother
## reason.for.school.selectionreputation
## guardianmother
## guardianother
## traveltime
## studytime
## past.class.failures **
## extra.school.supportyes
## family.education.supportyes
## extra.paid.classesyes *
## extra.curricular.activitiesyes *
## attended.nursery.schoolyes
## higher.educationyes
## internet.accessyes
## romantic.relationshipyes
## quality.of.family.relationships *
## freetime *
## going.out.with.friends ***
## weekday.alcohol.consumption ***
## health.status
## school.absences
## first.period.grade
## second.period.grade
## final.grade
## total_alcohol_consumption ***
## parents_education
## weekday.alcohol.consumption.factor2
## weekday.alcohol.consumption.factor3
## weekday.alcohol.consumption.factor4
## weekday.alcohol.consumption.factor5
## weekend.alcohol.consumption.factor2
## weekend.alcohol.consumption.factor3
## weekend.alcohol.consumption.factor4
## weekend.alcohol.consumption.factor5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.16e-15 on 995 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.457e+29 on 48 and 995 DF, p-value: < 2.2e-16
Weekday alcohol intake is better explained by the amount at which students go out. Holding all other factors unchanged, a one-unit rise in going out results in a 0.175 increase in weekday alcohol intake. Although weekday alcohol intake is positively associated with frequency of going out, amount of free time, and fitness when all other variables are held stable, the nature of family relationships is negatively correlated with the response variable. The more they went out, the more alcohol they drank (which is equivalent to how much spare time they had and how healthy they were), but the healthier their family relationships were, the less alcohol they drank.
For weekend alcohol consumption, family relationships, studytime and going out with friends are observed to be more significant variables.
ggplot(mat_por_students,aes(x=gender,y=past.class.failures,fill=gender))+
geom_bar(stat="identity")+ggtitle("Number of males versus females who failed in the final examination")
table(mat_por_students$past.class.failures,mat_por_students$gender)
##
## F M
## 0 497 364
## 1 65 55
## 2 18 15
## 3 11 19
table(mat_por_students$age,mat_por_students$past.class.failures)
##
## 0 1 2 3
## 15 179 7 5 3
## 16 249 22 7 3
## 17 237 27 3 10
## 18 175 35 7 5
## 19 18 26 6 6
## 20 3 3 3 0
## 21 0 0 2 1
## 22 0 0 0 2
ggplot(aes(x = failures), data = mat_por_students ) + geom_bar(aes(fill = gender)) + ggtitle("Past failures of students")
plot(x=failures, y=mat_por_students$final.grade, xlab = "Failures", ylab = "Rank", col = "green")
cor(mat_por_students$final.grade, mat_por_students$past.class.failures)
## [1] -0.3831453
ggplot(data = mat_por_students,aes(x = extra.school.support,y=final.grade,fill=extra.school.support))+
geom_boxplot(show.legend = F) + labs(x="Extra Educational Support",y="Final Score")+ ggtitle("Extra Educational Support and Final Grade")
Surprisingly, from above diagram, it can be illustrated that students who have got extra school support have scored less as compared to other students who got extra support from school.
ggplot(data= mat_por_students, aes(x=address, y=final.grade,fill=address))+
geom_boxplot() + geom_boxplot(show.legend = F) + labs(x="Address U: Urban, R: Rural",y="Final Score") + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))+ ggtitle("Rural/Urban Living Area in relation to Final Grades")
In comparison to rural students, urban students appear to perform better.
mat_por_students%>%
group_by(internet.access)%>%
ggplot(aes(x=final.grade, fill=internet.access))+
geom_density(alpha=0.5)+ ggtitle("Internet Access and Final Grade")
mat_por_students%>% group_by(weekday.alcohol.consumption)%>% aggregate(final.grade~weekday.alcohol.consumption, data=., mean)%>%
arrange(desc(final.grade))
mat_por_students%>%
group_by(weekend.alcohol.consumption)%>%
aggregate(final.grade~weekend.alcohol.consumption, data=., mean)%>%
arrange(desc(final.grade))
ggplot(data = mat_por_students,aes(x=final.grade))+ geom_bar(aes(fill=romantic.relationship),alpha=.9)+ labs(y="Proportion",x="Final Grade")
The percentage of students in love relationships is lower than the number of students who are not in love relationships. The distribution of final grades among students, whether in love or not, is very similar in the diagram above, but high final grades appear to be owned by those students who are not in a romantic relationship.
mat_por_students$going.out.with.friends.factor <- as.factor(mat_por_students$going.out.with.friends)
mat_por_students%>%
group_by(going.out.with.friends.factor)%>%
summarise(AverageScore= mean(final.grade))%>%
arrange(desc(going.out.with.friends.factor
))
It is observed that going out with friends has an impact on average score of students. As going out friends increases above 3, it decreases the average score of students.
ggplot(data = mat_por_students,aes(x=final.grade))+
geom_density(aes(fill=higher.education))+ labs(y="Frequency",x="Final Score")+ scale_color_grey() + theme_classic()+ ggtitle("Student Performance and the want for Higher Education")
It can be observed that, the students who desire to take higher education perform well in their examination as compared the other students.
plot(x=mat_por_students$studytime, y=mat_por_students$final.grade, xlab = "Study time", ylab = "Final grade", col = "brown")
cor(mat_por_students$studytime,mat_por_students$final.grade)
## [1] 0.1616289
From above graph and correlation, we can illustrate that Study time can have some effect on final grade, but not very strong.
ggplot(mat_por_students, aes(x="", y=mothers.job, fill=mothers.job)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0)
ggplot(data= mat_por_students, aes(x=mothers.job, y=final.grade,fill=mothers.job))+
geom_boxplot(show.legend = F)
ggplot(mat_por_students, aes(x=traveltime, y=final.grade)) +
geom_point()+ geom_density_2d()+
geom_smooth(method=lm)+ ggtitle("Travel Time to School and Final Grade")
## `geom_smooth()` using formula 'y ~ x'
mat_por_students%>% group_by(extra.paid.classes)%>% aggregate(final.grade~extra.paid.classes, data=., mean)%>%
arrange(desc(final.grade))
Surprisingly, students who havent taken extra paid classes, perform better than those who have taken extra paid classes.
80% of data will be training data set and 20% will be testing dataset.
set.seed (199)
trainindex=sample(nrow(mat_por_students),nrow(mat_por_students)*.8)
train_data <- mat_por_students[trainindex, ]
test_data <- mat_por_students[-trainindex, ]
str(train_data)
## 'data.frame': 835 obs. of 38 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 2 1 2 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 1 1 2 2 1 2 2 1 ...
## $ age : int 16 17 18 17 15 18 18 17 18 19 ...
## $ address : Factor w/ 2 levels "R","U": 2 1 2 1 2 2 2 2 2 1 ...
## $ family.size : Factor w/ 2 levels "GT3","LE3": 1 1 2 2 1 1 1 2 1 1 ...
## $ parents.cohabitation.status : Factor w/ 2 levels "A","T": 2 2 1 2 2 2 2 2 2 2 ...
## $ mothers.education : int 4 2 4 3 4 2 4 3 4 2 ...
## $ fathers.education : int 4 2 4 1 3 1 4 1 2 3 ...
## $ mothers.job : Factor w/ 5 levels "at_home","health",..: 5 3 2 4 5 4 2 4 5 4 ...
## $ fathers.job : Factor w/ 5 levels "at_home","health",..: 5 4 3 3 3 4 2 4 3 3 ...
## $ reason.for.school.selection : Factor w/ 4 levels "course","home",..: 2 3 2 4 1 3 4 1 2 1 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ traveltime : int 1 2 1 2 2 1 1 2 1 1 ...
## $ studytime : int 2 1 2 4 2 1 2 1 2 3 ...
## $ past.class.failures : int 0 0 0 0 0 1 1 0 0 1 ...
## $ extra.school.support : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ family.education.support : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 2 1 2 1 ...
## $ extra.paid.classes : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 1 1 2 1 ...
## $ extra.curricular.activities : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 1 2 2 ...
## $ attended.nursery.school : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 2 1 2 1 ...
## $ higher.education : Factor w/ 2 levels "no","yes": 2 1 2 2 2 1 2 2 2 2 ...
## $ internet.access : Factor w/ 2 levels "no","yes": 2 1 2 1 2 2 2 2 2 2 ...
## $ romantic.relationship : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 1 2 1 ...
## $ quality.of.family.relationships : int 4 5 4 3 5 3 2 2 4 5 ...
## $ freetime : int 4 2 2 1 4 2 4 4 3 4 ...
## $ going.out.with.friends : int 5 2 4 2 3 5 4 5 2 2 ...
## $ weekday.alcohol.consumption : int 5 1 1 1 1 2 1 3 1 1 ...
## $ weekend.alcohol.consumption : int 5 1 1 1 2 5 1 4 4 2 ...
## $ health.status : int 5 4 4 3 3 5 4 2 5 5 ...
## $ school.absences : int 16 0 0 6 2 4 2 6 11 0 ...
## $ first.period.grade : int 10 9 14 18 10 6 14 10 12 7 ...
## $ second.period.grade : int 12 10 15 18 10 9 12 10 11 5 ...
## $ final.grade : int 11 10 15 18 11 8 13 10 11 0 ...
## $ total_alcohol_consumption : num 10 2 2 2 3 7 2 7 5 3 ...
## $ parents_education : num 8 4 8 4 7 3 8 4 6 5 ...
## $ weekday.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 5 1 1 1 1 2 1 3 1 1 ...
## $ weekend.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 5 1 1 1 2 5 1 4 4 2 ...
## $ going.out.with.friends.factor : Factor w/ 5 levels "1","2","3","4",..: 5 2 4 2 3 5 4 5 2 2 ...
str(test_data)
## 'data.frame': 209 obs. of 38 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 1 2 1 1 ...
## $ age : int 18 16 16 16 16 15 16 16 16 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ family.size : Factor w/ 2 levels "GT3","LE3": 1 2 1 2 2 1 2 1 1 1 ...
## $ parents.cohabitation.status : Factor w/ 2 levels "A","T": 1 2 2 2 1 2 1 2 1 1 ...
## $ mothers.education : int 4 2 4 2 3 4 3 4 2 4 ...
## $ fathers.education : int 4 2 4 2 4 4 3 3 1 3 ...
## $ mothers.job : Factor w/ 5 levels "at_home","health",..: 1 3 4 3 4 4 3 2 3 4 ...
## $ fathers.job : Factor w/ 5 levels "at_home","health",..: 5 3 4 3 3 4 4 4 3 4 ...
## $ reason.for.school.selection : Factor w/ 4 levels "course","home",..: 1 2 4 4 2 4 2 4 3 4 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ traveltime : int 2 1 1 2 1 2 1 1 1 1 ...
## $ studytime : int 2 2 3 2 2 2 2 4 2 2 ...
## $ past.class.failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ extra.school.support : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ family.education.support : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 1 2 ...
## $ extra.paid.classes : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 2 2 ...
## $ extra.curricular.activities : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 2 2 2 ...
## $ attended.nursery.school : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ higher.education : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet.access : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ romantic.relationship : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ quality.of.family.relationships : int 4 4 3 5 5 4 2 4 5 4 ...
## $ freetime : int 3 4 2 4 3 3 3 2 3 3 ...
## $ going.out.with.friends : int 4 4 3 4 3 1 5 2 4 2 ...
## $ weekday.alcohol.consumption : int 1 1 1 2 1 1 1 1 1 1 ...
## $ weekend.alcohol.consumption : int 1 1 2 4 1 1 4 1 1 1 ...
## $ health.status : int 3 3 2 5 5 5 3 2 2 1 ...
## $ school.absences : int 6 0 6 0 4 0 12 4 8 0 ...
## $ first.period.grade : int 5 12 13 13 11 17 11 19 8 14 ...
## $ second.period.grade : int 6 12 14 13 11 16 12 19 9 15 ...
## $ final.grade : int 6 11 14 12 11 17 11 20 10 15 ...
## $ total_alcohol_consumption : num 2 2 3 6 2 2 5 2 2 2 ...
## $ parents_education : num 8 4 8 4 7 8 6 7 3 7 ...
## $ weekday.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 2 1 1 1 1 1 1 ...
## $ weekend.alcohol.consumption.factor: Factor w/ 5 levels "1","2","3","4",..: 1 1 2 4 1 1 4 1 1 1 ...
## $ going.out.with.friends.factor : Factor w/ 5 levels "1","2","3","4",..: 4 4 3 4 3 1 5 2 4 2 ...
student_grade_model_1 = lm(final.grade ~ first.period.grade + second.period.grade, data = train_data)
summary(student_grade_model_1)
##
## Call:
## lm(formula = final.grade ~ first.period.grade + second.period.grade,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9324 -0.3781 0.1006 0.8585 6.0676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.09930 0.21364 -5.146 3.33e-07 ***
## first.period.grade 0.13619 0.03572 3.812 0.000148 ***
## second.period.grade 0.96698 0.03231 29.930 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.587 on 832 degrees of freedom
## Multiple R-squared: 0.8338, Adjusted R-squared: 0.8334
## F-statistic: 2087 on 2 and 832 DF, p-value: < 2.2e-16
In the above model, it can be observed that p-value : < 2.2e-16 which is smaller than 0.05. It means that either of above independent variables is highly correlated to the dependent variable, i.e final grade of students. It can be noted that first period grade and second period grade is significantly associated with the final grade. can determine that students who perform well on first will do better on second and final grade
studytime_walc_model = lm(final.grade ~ studytime + total_alcohol_consumption , data = train_data)
summary(studytime_walc_model)
##
## Call:
## lm(formula = final.grade ~ studytime + total_alcohol_consumption,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9759 -1.5898 0.2649 2.4184 8.1032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5274 0.4692 22.435 < 2e-16 ***
## studytime 0.6930 0.1605 4.317 1.77e-05 ***
## total_alcohol_consumption -0.1618 0.0686 -2.358 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.823 on 832 degrees of freedom
## Multiple R-squared: 0.03495, Adjusted R-squared: 0.03263
## F-statistic: 15.07 on 2 and 832 DF, p-value: 3.739e-07
grade_model <- lm(final.grade ~ past.class.failures+studytime+higher.education+extra.school.support+internet.access+going.out.with.friends+romantic.relationship,data = train_data)
summary(grade_model)
##
## Call:
## lm(formula = final.grade ~ past.class.failures + studytime +
## higher.education + extra.school.support + internet.access +
## going.out.with.friends + romantic.relationship, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9535 -1.5050 0.3338 2.1313 7.6352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.63584 0.65069 14.809 < 2e-16 ***
## past.class.failures -1.98850 0.19194 -10.360 < 2e-16 ***
## studytime 0.47060 0.14798 3.180 0.00153 **
## higher.educationyes 1.45080 0.45190 3.210 0.00138 **
## extra.school.supportyes -1.25310 0.39132 -3.202 0.00142 **
## internet.accessyes 0.67877 0.30025 2.261 0.02404 *
## going.out.with.friends -0.07181 0.10738 -0.669 0.50386
## romantic.relationshipyes -0.62254 0.25592 -2.433 0.01520 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.489 on 827 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.1942
## F-statistic: 29.71 on 7 and 827 DF, p-value: < 2.2e-16
model =lm(final.grade~., data= train_data)
summary(model)
##
## Call:
## lm(formula = final.grade ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9156 -0.5030 0.1236 0.7887 5.6795
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4663462 1.1071484 -0.421 0.67371
## schoolMS 0.1852471 0.1517368 1.221 0.22251
## genderM -0.0120514 0.1312554 -0.092 0.92687
## age -0.0530898 0.0524889 -1.011 0.31211
## addressU 0.1663470 0.1404447 1.184 0.23660
## family.sizeLE3 0.0276404 0.1279574 0.216 0.82903
## parents.cohabitation.statusT -0.0898500 0.1844827 -0.487 0.62637
## mothers.education 0.0010577 0.0797069 0.013 0.98942
## fathers.education -0.0394428 0.0712009 -0.554 0.57976
## mothers.jobhealth 0.0499081 0.2839346 0.176 0.86052
## mothers.jobother -0.0223596 0.1713440 -0.130 0.89621
## mothers.jobservices 0.0743007 0.2027081 0.367 0.71406
## mothers.jobteacher 0.1035629 0.2681834 0.386 0.69948
## fathers.jobhealth -0.1215383 0.3787235 -0.321 0.74836
## fathers.jobother -0.2184594 0.2434501 -0.897 0.36981
## fathers.jobservices -0.3523352 0.2558579 -1.377 0.16888
## fathers.jobteacher -0.3884929 0.3502331 -1.109 0.26767
## reason.for.school.selectionhome -0.1458425 0.1472554 -0.990 0.32228
## reason.for.school.selectionother -0.2348689 0.1955910 -1.201 0.23018
## reason.for.school.selectionreputation -0.0564394 0.1542963 -0.366 0.71462
## guardianmother 0.1638691 0.1400655 1.170 0.24238
## guardianother 0.1643582 0.2670234 0.616 0.53839
## traveltime 0.0741328 0.0875349 0.847 0.39731
## studytime -0.0457891 0.0741487 -0.618 0.53706
## past.class.failures -0.2540944 0.0992845 -2.559 0.01068
## extra.school.supportyes 0.0847372 0.1929170 0.439 0.66061
## family.education.supportyes 0.2939769 0.1233044 2.384 0.01736
## extra.paid.classesyes -0.2774266 0.1429470 -1.941 0.05264
## extra.curricular.activitiesyes -0.2758952 0.1176846 -2.344 0.01931
## attended.nursery.schoolyes -0.1215181 0.1424381 -0.853 0.39385
## higher.educationyes -0.2254339 0.2185704 -1.031 0.30267
## internet.accessyes -0.0070013 0.1479869 -0.047 0.96228
## romantic.relationshipyes -0.0643190 0.1223258 -0.526 0.59918
## quality.of.family.relationships 0.0816132 0.0625098 1.306 0.19207
## freetime 0.0007795 0.0603866 0.013 0.98970
## going.out.with.friends 0.0718472 0.0701772 1.024 0.30625
## weekday.alcohol.consumption -0.1287653 0.1137774 -1.132 0.25809
## weekend.alcohol.consumption 0.1456309 0.0853507 1.706 0.08835
## health.status -0.0072131 0.0411645 -0.175 0.86095
## school.absences 0.0286119 0.0091740 3.119 0.00188
## first.period.grade 0.1544142 0.0375032 4.117 4.24e-05
## second.period.grade 0.9455785 0.0331597 28.516 < 2e-16
## total_alcohol_consumption NA NA NA NA
## parents_education NA NA NA NA
## weekday.alcohol.consumption.factor2 -0.2259466 0.1856214 -1.217 0.22388
## weekday.alcohol.consumption.factor3 0.1916335 0.3049241 0.628 0.52988
## weekday.alcohol.consumption.factor4 -0.1956194 0.4455736 -0.439 0.66076
## weekday.alcohol.consumption.factor5 NA NA NA NA
## weekend.alcohol.consumption.factor2 -0.2448370 0.1592861 -1.537 0.12467
## weekend.alcohol.consumption.factor3 -0.1745679 0.2006880 -0.870 0.38465
## weekend.alcohol.consumption.factor4 -0.0904112 0.2567847 -0.352 0.72487
## weekend.alcohol.consumption.factor5 NA NA NA NA
## going.out.with.friends.factor2 0.1850878 0.2071372 0.894 0.37184
## going.out.with.friends.factor3 0.3359318 0.1690766 1.987 0.04729
## going.out.with.friends.factor4 0.1250463 0.1782406 0.702 0.48316
## going.out.with.friends.factor5 NA NA NA NA
##
## (Intercept)
## schoolMS
## genderM
## age
## addressU
## family.sizeLE3
## parents.cohabitation.statusT
## mothers.education
## fathers.education
## mothers.jobhealth
## mothers.jobother
## mothers.jobservices
## mothers.jobteacher
## fathers.jobhealth
## fathers.jobother
## fathers.jobservices
## fathers.jobteacher
## reason.for.school.selectionhome
## reason.for.school.selectionother
## reason.for.school.selectionreputation
## guardianmother
## guardianother
## traveltime
## studytime
## past.class.failures *
## extra.school.supportyes
## family.education.supportyes *
## extra.paid.classesyes .
## extra.curricular.activitiesyes *
## attended.nursery.schoolyes
## higher.educationyes
## internet.accessyes
## romantic.relationshipyes
## quality.of.family.relationships
## freetime
## going.out.with.friends
## weekday.alcohol.consumption
## weekend.alcohol.consumption .
## health.status
## school.absences **
## first.period.grade ***
## second.period.grade ***
## total_alcohol_consumption
## parents_education
## weekday.alcohol.consumption.factor2
## weekday.alcohol.consumption.factor3
## weekday.alcohol.consumption.factor4
## weekday.alcohol.consumption.factor5
## weekend.alcohol.consumption.factor2
## weekend.alcohol.consumption.factor3
## weekend.alcohol.consumption.factor4
## weekend.alcohol.consumption.factor5
## going.out.with.friends.factor2
## going.out.with.friends.factor3 *
## going.out.with.friends.factor4
## going.out.with.friends.factor5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.571 on 784 degrees of freedom
## Multiple R-squared: 0.8465, Adjusted R-squared: 0.8367
## F-statistic: 86.46 on 50 and 784 DF, p-value: < 2.2e-16
test_data$final.grade.predicted <- predict(object = model,newdata = test_data)
## Warning in predict.lm(object = model, newdata = test_data): prediction from a
## rank-deficient fit may be misleading
test_data$final.grade
## [1] 6 11 14 12 11 17 11 20 10 15 16 11 10 12 10 7 17 10 13 16 14 8 11 0 12
## [26] 15 11 12 13 0 16 12 15 13 9 8 10 16 10 18 16 9 7 8 4 6 17 7 9 12
## [51] 14 13 11 0 12 0 12 13 14 9 0 17 10 10 11 14 12 8 10 9 13 8 10 10 19
## [76] 14 10 14 13 14 7 14 12 12 11 11 11 9 12 16 16 12 10 11 14 12 11 12 13 18
## [101] 14 11 9 11 12 11 11 11 13 6 0 8 11 11 9 18 13 10 11 16 13 11 15 13 11
## [126] 14 16 11 16 13 14 10 8 16 13 11 10 6 12 13 14 12 12 12 10 14 12 15 17 13
## [151] 18 15 17 14 17 14 10 10 15 12 17 14 12 15 18 11 9 11 13 11 13 10 12 12 16
## [176] 10 14 10 9 13 10 18 10 15 16 9 10 11 12 10 10 9 8 10 11 8 0 18 14 0
## [201] 12 0 14 10 12 9 17 12 11
test_data$final.grade.predicted
## [1] 5.51435922 12.04248661 13.87829686 13.38305348 11.47388837 16.33415909
## [7] 12.58430917 19.37423819 8.31207301 15.00091047 16.99132618 8.60100485
## [13] 7.00453515 11.50323561 9.63939403 7.53157761 16.99715554 8.90528897
## [19] 12.23039537 15.21920287 12.84052185 6.40678958 9.21255185 0.39103735
## [25] 12.42305850 16.25008744 11.08915832 10.79067920 12.59866980 6.52945072
## [31] 14.89648738 13.06522262 14.93395628 12.37325854 7.59052521 7.17811429
## [37] 8.88240122 15.68611624 9.09766871 19.03774662 17.07287244 9.49757712
## [43] 5.81101529 6.24403188 5.07683851 5.20588229 15.86239306 7.71485884
## [49] 8.35302364 10.28869469 13.07892783 12.67485593 10.52050829 6.77793330
## [55] 12.42107443 -0.07667694 11.42403586 12.34485628 14.22914401 8.75979456
## [61] 9.22360809 17.79834397 10.13989354 8.57653330 9.96767351 14.04960534
## [67] 12.41016695 7.74524413 10.40559414 9.62553103 13.33738981 7.89813577
## [73] 9.97679192 10.07623471 18.85864501 14.11672951 10.70432290 13.94720232
## [79] 12.26530368 13.32459251 6.75939215 12.74850964 11.14044690 10.73900724
## [85] 11.71466612 10.69992549 10.90908180 9.49499387 12.35592454 15.76799881
## [91] 15.96893484 11.82671328 9.47609935 11.42461306 12.55047438 10.93227402
## [97] 11.81145284 12.19988143 13.71723402 18.19125607 14.40584575 11.08418696
## [103] 8.52645962 9.67929636 10.19445120 11.59807827 11.19033155 10.76600551
## [109] 13.17097718 7.69407365 8.83149384 6.89493058 10.26759661 10.37814801
## [115] 6.69885002 17.47108973 12.85661308 9.15631513 9.25536133 15.37096320
## [121] 12.41322511 10.08938602 14.63141807 13.67690888 9.70304345 12.50519116
## [127] 16.37874342 10.02385032 15.19371085 12.74672886 13.05858905 10.38416455
## [133] 7.64423230 16.97107188 11.62180458 11.82950832 11.53252194 7.34349680
## [139] 9.90224275 11.70102007 10.91745544 10.20374185 11.99970642 12.55170606
## [145] 9.30325886 12.65938642 12.07903337 14.94652489 17.92284489 11.83295361
## [151] 18.60033508 12.48719417 15.26370874 12.25278529 15.15195188 11.19471086
## [157] 10.61877944 8.84222069 14.57577590 10.61733111 14.09238214 13.35734987
## [163] 10.05119713 14.30080175 18.92729907 11.40754467 8.80591722 9.75220173
## [169] 13.12376768 10.33351858 10.65987986 9.45224355 12.76029133 11.03984040
## [175] 14.17272760 8.19339249 13.26617401 9.80032287 8.79922387 12.95882662
## [181] 11.30231429 17.92255635 8.75833975 14.58790446 14.59199121 8.93614246
## [187] 11.30846495 11.15881454 11.24483041 10.78926913 8.08178568 9.83693916
## [193] 8.03855245 9.08324995 8.89660104 7.27389069 7.98752644 18.45544067
## [199] 13.65115520 -0.76768672 11.47811232 0.21543678 13.08569443 9.59466610
## [205] 11.03042595 7.46453687 17.29540982 10.82380163 8.56635482
# RMSE = Computes to the average difference between actual and predicted value.
rmse = sqrt(mean((test_data$final.grade - test_data$final.grade.predicted)^2) )
rmse
## [1] 1.614811
# Standard error rate of model = rmse/mean(actual dependent value) * 100
std_error_rate_lm = rmse/mean(test_data$final.grade)*100
std_error_rate_lm
## [1] 13.9116
From above we can make out that our RMSE value is 1.6 representing an error rate of 13%, which is good also r-square value for the linear regression model is 82%.
math_grades <- mat_students %>%
gather(`first.period.grade`, `second.period.grade`, `final.grade`, key="semester", value="grade") %>%
ggplot() +
geom_bar(aes(x=grade, fill=semester), position="dodge") +
ggtitle("Distribution of three grades in Math")
por_grades <- por_students %>%
gather(`first.period.grade`, `second.period.grade`, `final.grade`, key="semester", value="grade") %>%
ggplot() +
geom_bar(aes(x=grade, fill=semester), position="dodge") +
ggtitle("Distribution of three grades in Portuguese")
grid.arrange(math_grades,por_grades)
mean(mat_students$final.grade)
## [1] 10.41519
mean(por_students$final.grade)
## [1] 11.90601
From above graph, we can illustrate that there is not much difference grades of student in both the classes. But we can see that there were more number of students who scored 0 in final exam for Math class as compared to Portuguese class. Also, average grades of students is more for Portuguese class in comparison with Math class.
math_school_grades <- ggplot(mat_students) +
geom_bar(aes(x=school, fill=as.factor(final.grade)), position="dodge") +
ggtitle("Maths grades by school") +
theme(legend.position = "none")
port_school_grades <- ggplot(por_students) +
geom_bar(aes(x=school, fill=as.factor(final.grade)), position="dodge") +
ggtitle("Portuguese grades by school") +
theme(legend.position = "none")
grid.arrange(math_school_grades, port_school_grades)
Similar trends can be observed in both the classes, but average grades of Gabriel Pereira (GP) school is more that the Mousinho da Silveira (MS) school.
math_school_grades <- ggplot(mat_students, aes(x=final.grade)) +geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
ggtitle("Maths students' grades by school")+
scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
scale_fill_manual(values = c("#868686FF", "#EFC000FF"))
port_school_grades <- ggplot(por_students, aes(x=final.grade)) +
geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
ggtitle("Portuguese students grades by school")+
scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
scale_fill_manual(values = c("#868686FF", "#EFC000FF"))
grid.arrange(math_school_grades, port_school_grades)
It can be observed a similar trend for Math class in both the schools, but students from GP school are seen outperfoming as compared to MS school in Portuguese class.
Students living in urban areas tend to perform well as compared with the rural areas in both Math and Portuguese class.
mat_alcohol_level = ggplot(mat_students, aes(x=weekday.alcohol.consumption)) +
geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
ggtitle("Maths students grades by school")+
scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
scale_fill_manual(values = c("#868686FF", "#EFC000FF"))
port_alcohol_level = ggplot(por_students, aes(x=weekday.alcohol.consumption)) +
geom_density(aes(color=school),linetype = "dashed", size = 0.7) +
ggtitle("Portuguese students grades by school")+
scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
scale_fill_manual(values = c("#868686FF", "#EFC000FF"))
grid.arrange(mat_alcohol_level,port_alcohol_level)
# Calculate total alcohol consumption for Math and Portuguese class
mat_students$total.alcohol.consumption = rowSums(cbind(mat_students$weekday.alcohol.consumption, mat_students$weekend.alcohol.consumption))
por_students$total.alcohol.consumption = rowSums(cbind(por_students$weekday.alcohol.consumption, por_students$weekend.alcohol.consumption))
hist(mat_students$weekday.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekdays by Math class",xlab="Students", border="blue", col="purple",ylab="Weekely alcohol frequency")
hist(mat_students$weekend.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekends by Math class",xlab="Students", border="blue", col="purple",ylab="Weekely alcohol frequency")
hist(por_students$weekday.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekdays by Portuguese class",xlab="Students", border="blue", col="green",ylab="Weekely alcohol frequency")
hist(por_students$weekend.alcohol.consumption, main="Alcoholic Beverages Consumed on Weekends by Portugueseclass",xlab="Students", border="blue", col="green",ylab="Weekely alcohol frequency")
mean(mat_students$weekday.alcohol.consumption)
## [1] 1.481013
mean(mat_students$weekend.alcohol.consumption)
## [1] 2.291139
mean(por_students$weekday.alcohol.consumption)
## [1] 1.502311
mean(por_students$weekend.alcohol.consumption)
## [1] 2.280431
Alcohol consumption level for both the classes tend to be similar for both during weekdays and weekends.
math_class_family <- ggplot(mat_students, aes(x=final.grade)) +
geom_density(aes(color=as.factor(quality.of.family.relationships))) +
ggtitle("Maths students grades by family relationships")
port_class_family <- ggplot(por_students, aes(x=final.grade)) +
geom_density(aes(color=as.factor(quality.of.family.relationships))) +
ggtitle("Portuguese students grades by family relationships")
grid.arrange(math_class_family, port_class_family)
Surprisingly, students with the poorest family relationships had a higher overall math score than students with stronger relationships. Portuguese, on the other hand, is the exact opposite.
Based off the data analysis we have concluded that weekday student alcohol consumption is correlated with the amount of freetime a student has, the amount of time they study, how often they go out with their friends, the quality of their family relationships, and weekend alcohol consumption. This information can be useful for the secondary school in that they can create after school programs, provide counseling services, to try to minimize the amount of freetime students have that can lead to alcohol consumption. Additionally, since weekday and weekend consumption is statistically significant to one another parent scan focus on ways to reduce weekday consumption by providing better guidance and better activities for students during their freetime.
Next, based off our data analysis on student performance we can conclude that past class failures, school absences, first period grade, second period grade, extra curricular activities and educational support from family are statistically significant to the students final grade. This will allow the school to focus their attention on the material that students need most help on so that they can achieve better grades.
Lastly, after reviewing the difference between performance in both classes our data analysis shows that students from the Portuguese language class on average obtained better grades than the Math class. However, both classes consumed about the same amount number of alcoholic beverages during the week and both classes showed an increase over the weekend.
In conclusion school administrators can use this information to create more opportunities for after-school activities, and provide better tutoring services to students who fall behind at the start of the period.